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Abstract 

When applying aggregating strategies to Prediction with Expert Ad- 
vice, the learning rate must be adaptively tuned. The natural choice of 
^/complexity/current loss renders the analysis of Weighted Majority deriva- 
tives quite complicated. In particular, for arbitrary weights there have been 
no results proven so far. The analysis of the alternative "Follow the Perturbed 
Leader" (FPL) algorithm from Kalai and Vempala (2003) based on Hannan's 
algorithm is easier. We derive loss bounds for adaptive learning rate and 
both finite expert classes with uniform weights and countable expert classes 
with arbitrary weights. For the former setup, our loss bounds match the best 
known results so far, while for the latter our results are new. 
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1 Introduction 

In Prediction with Expert Advice (PEA) one considers an ensemble of sequential 
predictors (experts). A master algorithm is constructed based on the historical 
performance of the predictors. The goal of the master algorithm is to perform 
nearly as well as the best expert in the class, on any sequence of outcomes. This is 
achieved by making (randomized) predictions close to the better experts. 

PEA theory has rapidly developed in the recent past. Starting with the Weighted 
Majority (WM) algorithm of [LW89, LW94] and the aggregating strategy of [Vov90], 
a vast variety of different algorithms and variants have been published. A key pa- 
rameter in all these algorithms is the learning rate. While this parameter had to be 
fixed in the early algorithms such as WM, [CB97] established the so-called doubling 
trick to make the learning rate coarsely adaptive. A little later, incrementally adap- 
tive algorithms were developed by [AGOO, ACBG02, YEYS04, Gen03], and others. 
In Section 10, we will compare our results with these works more in detail. Unfortu- 
nately, the loss bound proofs for the incrementally adaptive WM variants are quite 
complex and technical, despite the typically simple and elegant proofs for a static 
learning rate. 

The complex growing proof techniques also had another consequence: While for 
the original WM algorithm, assertions are proven for countable classes of experts 
with arbitrary weights, the modern variants usually restrict to finite classes with 
uniform weights (an exception being [Gen03], see the discussion section). This 
might be sufficient for many practical purposes but it prevents the application to 
more general classes of predictors. Examples are extrapolating (=predicting) data 
points with the help of a polynomial (=expert) of degree d— 1,2,3,... -or- the (from a 
computational point of view largest) class of all computable predictors. Furthermore, 
most authors have concentrated on predicting binary sequences, often with the 0/1 
loss for {0,l}-valued and the absolute loss for [0,l]-valued predictions. Arbitrary 
losses are less common. Nevertheless, it is easy to abstract completely from the 
predictions and consider the resulting losses only. Instead of predicting according 
to a "weighted majority" in each time step, one chooses one single expert with a 
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probability depending on his past cumulated loss. This is done e.g. by [FS97], where 
an elegant WM variant, the Hedge algorithm, is analyzed. 

A different, general approach to achieve similar results is "Follow the Perturbed 
Leader" (FPL). The principle dates back to as early as 1957, now called Hannan's 
algorithm [Han57]. In 2003, Kalai and Vempala published a simpler proof of the 
main result of Hannan and also succeeded to improve the bound by modifying the 
distribution of the perturbation. The resulting algorithm (which they call FPL*) 
has the same performance guarantees as the WM-type algorithms for fixed learning 
rate, save for a factor of \/2. A major advantage we will discover in this work is 
that its analysis remains easy for an adaptive learning rate, in contrast to the WM 
derivatives. Moreover, it generalizes to online decision problems other than PEA. 

In this work, we study the FPL algorithm for PEA. The problems of WM algo- 
rithms mentioned above are addressed: Bounds on the cumulative regret of the stan- 
dard form y/kL (where k is the complexity and L is the cumulative loss of the best 
expert in hindsight) are shown for countable expert classes with arbitrary weights, 
adaptive learning rate, and arbitrary losses. Regarding the adaptive learning rate, 
we obtain proofs that are simpler and more elegant than for the corresponding WM 
algorithms. (In particular, the proof for a self-confident choice of the learning rate, 
Theorem 7, is less than half a page.) Further, we prove the first loss bounds for 
arbitrary weights and adaptive learning rate. In order to obtain the optimal \fk~L 
bound in this case, we will need to introduce a hierarchical version of FPL, while 
without hierarchy we show a worse bound ky/L. (For self-confident learning rate 
together with uniform weights and arbitrary losses, one can prove corresponding 
results for a variant of WM by adapting an argument by [ACBG02].) 

PEA usually refers to an online worst case setting: n experts that deliver se- 
quential predictions over a time range t — l,...,T are given. At each time t, we know 
the actual predictions and the past losses. The goal is to give a prediction such 
that the overall loss after T steps is "not much worse" than the best expert's loss 
on any sequence of outcomes. If the prediction is deterministic, then an adversary 
could choose a sequence which provokes maximal loss. So we have to randomize our 
predictions. Consequently, we ask for a prediction strategy such that the expected 
loss on any sequence is small. 

This paper is structured as follows. In Section 2 we give the basic definitions. 
While [KV03] consider general online decision problems in finite-dimensional spaces, 
we focus on online prediction tasks based on a countable number of experts. Like 
[KV03] we exploit the infeasible FPL predictor (IFPL) in our analysis. Sections 3 
and 4 derive the main analysis tools. In Section 3 we generalize (and marginally 
improve) the upper bound [KV03, Lem.3] on IFPL to arbitrary weights. The main 
difficulty we faced was to appropriately distribute the weights to the various terms. 
For the corresponding lower bound (Section 7) this is an open problem. In Section 4 
we exploit our restricted setup to significantly improve [KV03, Eq.(3)] allowing for 
bounds logarithmic rather than linear in the number of experts. The upper and lower 
bounds on IFPL are combined to derive various regret bounds on FPL in Section 5. 
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Bounds for static and dynamic learning rate in terms of the sequence length follow 
straight-forwardly. The proof of our main bound in terms of the loss is much more 
elegant than the analysis of previous comparable results. Section 6 proposes a novel 
hierarchical procedure to improve the bounds for non-uniform weights. In Section 7, 
a lower bound is established. In Section 8, we consider the case of independent 
randomization more seriously. In particular, we show that the derived bounds also 
hold for an adaptive adversary. Section 9 treats some additional issues, including 
bounds with high probability, computational aspects, deterministic predictors, and 
the absolute loss. Finally, in Section 10 we discuss our results, compare them to 
references, and state some open problems. 

2 Setup and Notation 

Setup. Prediction with Expert Advice proceeds as follows. We are asked to perform 
sequential predictions yt&y at times t—1,2,.... At each time step t, we have access 
to the predictions {y l t )i<i< n of n experts {ei,...,e n }, where the size of the expert pool 
is nG IVU{oo}. It is convenient to use the same notation for finite (n G IV) and 
countably infinite (n = oo) expert pool. After having made a prediction, we make 
some observation x t EX, and a Loss is revealed for our and each expert's prediction. 
(E.g. the loss might be 1 if the expert made an erroneous prediction and otherwise. 
This is the 0/1 loss.) Our goal is to achieve a total loss "not much worse" than the 
best expert, after t time steps. 

We admit ti6lVU{oo} experts, each of which is assigned a known complexity 
k l > 0. Usually we require X^e~ fcl < 1, which implies that the k l are valid lengths 
of prefix code words, for instance k l = Inn if n < oo or k l — | + 21m if n = oo. Each 
complexity defines a weight by means of e~ fel and vice versa. In the following we 
will talk of complexities rather than of weights. If n is finite, then usually one sets 
k l = Inn for all i; this is the case of uniform complexities/weights. If the set of 
experts is countably infinite (n = oo), uniform complexities are not possible. The 
vector of all complexities is denoted by k — {k l )i<i< n . At each time t, each expert i 
suffers a loss 1 s] =Loss(x t,yl) G [0,1], and st = (sj)i<i<n is the vector of all losses at 
time t. Let s <t = s± + ...+s t -i (respectively si :t — si + ... + s t ) be the total past loss 
vector (including current loss s t ) and s™™ = minj{V 1:t } be the loss of the best expert 
in hindsight (BEH). Usually we do not know in advance the time t>0 at which the 
performance of our predictions are evaluated. 

General decision spaces. The setup can be generalized as follows. Let SclR n be 
the state space and T> C M n the decision space. At time t the state is s t G S, and a 
decision d t &T> (which is made before the state is revealed) incurs a loss d t °St, where 
"°" denotes the inner product. This implies that the loss function is linear in the 
states. Conversely, each linear loss function can be represented in this way. The 

1 The setup, analysis and results easily scale to s\€ [0,5] for S*>0 other than 1. 
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decision which minimizes the loss in state sGS is 



M(s) := argmin{c? s} (1) 

if the minimum exists. The application of this general framework to PEA is straight- 
forward: V is identified with the space of all unit vectors £ = {e i :\<i<n} ) since a 
decision consists of selecting a single expert, and s t E [0,l] n , so states are identified 
with losses. Only Theorems 2 and 10 will be stated in terms of general decision 
space. Our main focus is T>—£. (Even for this special case, the scalar product nota- 
tion is not too heavy, but will turn out to be convenient.) All our results generalize 
to the simplex V = A = {vE [0,l] n :^j^ = l}, since the minimum of a linear function 
on A is always attained on £. 

Follow the Perturbed Leader. Given s <t at time t, an immediate idea to solve the 
expert problem is to "Follow the Leader" (FL), i.e. selecting the expert which per- 
formed best in the past (minimizes s l <t ), that is predict according to expert M(s <t ). 
This approach fails for two reasons. First, for n = oo the minimum in (1) may not 
exist. Second, for n = 2 and s— (iqioio )' ^ a l wa y s chooses the wrong prediction 
[KV03]. We solve the first problem by penalizing each expert by its complexity, i.e. 
predicting according to expert M(s <t + k). The FPL (Follow the Perturbed Leader) 
approach solves the second problem by adding to each expert's loss s l <t a random 
perturbation. We choose this perturbation to be negative exponentially distributed, 
either independent in each time step or once and for all at the very beginning at 
time t = 0. The former choice is preferable in order to protect against an adaptive 
adversary who generates the s t , and in order to get bounds with high probability 
(Section 9). For the main analysis however, the latter choice is more convenient. 
Due to linearity of expectations, these two possibilities are equivalent when dealing 
with expected losses (this is straightforward for oblivious adversary, for adaptive ad- 
versary see Section 8), so we can henceforth assume without loss of generality one 
initial perturbation q. 

The FPL algorithm is defined as follows: 

Choose random vector g~exp, i.e. P[q 1 ...q n ]=e- q ■ ... -e" 9 " for q > 0. 
For t = l,...,T 

- Choose learning rate r] t . 

- Output prediction of expert i which minimizes s l <t -\-{k l — q l )/r] t . 

- Receive loss s\ for all experts i. 

Other than s<t, k and q, FPL depends on the learning rate i] t . We will give choices 
for f] t in Section 5, after having established the main tools for the analysis. The 
expected loss at time t of FPL is t t - = E[M(s <t + } ^ L )°s t }. The key idea in the FPL 
analysis is the use of an intermediate predictor IFPL (for Implicit or Infeasible FPL). 
IFPL predicts according to M(si :t +^ 2 ), thus under the knowledge of St (which is of 
course not available in reality). By r t := E[M{s\-t + ^^) °St] we denote the expected 
loss of IFPL at time t. The losses of IFPL will be upper-bounded by BEH in 
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Section 3 and lower-bounded by FPL in Section 4. Note that our definition of the 
FPL algorithm deviates from that of [KV03]. It uses an exponentially distributed 
perturbation similar to their FPL* but one-sided and a non-stationary learning rate 
like Hannan's algorithm. 

Notes. Observe that we have stated the FPL algorithm regardless of the actual 
predictions of the experts and possible observations, only the losses are relevant. 
Note also that an expert can implement a highly complicated strategy depending 
on past outcomes, despite its trivializing identification with a constant unit vector. 
The complex expert's (and environment's) behavior is summarized and hidden in the 
state vector st=Loss(x t ,yl)i<i< n . Our results therefore apply to arbitrary prediction 
and observation spaces y and X and arbitrary bounded loss functions. This is in 
contrast to the major part of PEA work developed for binary alphabet and 0/1 or 
absolute loss only. Finally note that the setup allows for losses generated by an 
adversary who tries to maximize the regret of FPL and knows the FPL algorithm 
and all experts' past predictions/losses. If the adversary also has access to FPL's 
past decisions, then FPL must use independent randomization at each time step in 
order to achieve good regret bounds. 

Motivation of FPL. Let d(s <t ) be any predictor with decision based on s <t . The 
following identity is easy to show: 

< if d ss M small if d(-) is continuous 



J2 d ( s <t)° s t = d(s 1:T )°s 1:T + J2[d(s <t )-d(s 1:t )]°s <t + Y^[ d ( s <t)- d ( s ht)} St (2) 
t=i t=i t=i 



"FPL" "BEH" "IFPL-BEH" "FPL-IFPL" 

For a good bound of FPL in terms of BEH we need the first term on the r.h.s. 
to be close to BEH and the last two terms to be small. The first term is close to 
BEH if d ~ M. The second to last term is even negative if d — M, hence small 
if d~M. The last term is small if d{s < t) ~rf(si:t), which is the case if d(-) is a 
sufficiently smooth function. Randomization smoothes the discontinuous function 
M: The function d(s):—E[M(s—q)], where q<ElR n is some random perturbation, is 
a continuous function in s. If the mean and variance of q are small, then d~M, if 
the variance of q is large, then d(s <t ) ~ d(si :t ). An intermediate variance makes the 
last two terms of (2) simultaneously small enough, leading to excellent bounds for 
FPL. 

List of notation. 

nGlVU{oo} (n = oo means countably infinite £). 

x l is ith component of vector xElR n . 

£ := {e,i : 1 < i < n} = set of unit vectors [e\ = Sij). 

A:={vE [0,l] n : = 1}= simplex. 

St G [0,l] n = environmental state/loss vector at time t. 

s 1 . t : = s 1 + ... + s t = state/loss (similar for £ t and r t ). 
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s mm_ m j ni | s « -t}— loss of Best Expert in Hindsight (BEH). 
s<t :—Si + ...+St-i— state/loss summary (s <0 = 0). 
M(s) : = argminrf G x){c?°s}= best decision on s. 
Tg1V = total time=step, tElN= current time=step. 
k t >0= penalization = complexity of expert i. 

q^M n = random vector with independent exponentially distributed components. 



it 

n 

Ut 



argminj g £-{s l <t + ^— —}= randomized prediction of FPL. 



Vf 



■■E[M(s <t + ^)°s t }= expected loss at time t of FPL {=E[s{ t } for V = £). 



E[M(si :t + ^f)°St\= expected loss at time t of IFPL. 



M{s <t + ^)°s t = actual loss at time t of FPL (=s I t t for V = £). 



3 IFPL bounded by Best Expert in Hindsight 

In this section we provide tools for comparing the loss of IFPL to the loss of the 
best expert in hindsight. The first result bounds the expected error induced by the 
exponentially distributed perturbation. 

Lemma 1 (Maximum of Shifted Exponential Distributions) Let q 1 ,...,q n be 
(not necessarily independent) exponentially distributed random variables, i.e. P[q l ] = 
e~ ql for q % > and 1 < i < n < oo, and k % e M be real numbers with u := I]" =1 e~ fcl . 
Then 



P[max{g* - k 1 } > a] = 1 — TT max{0, 1-e a k '} if q 1 , q n are independent, 
P[m&x{q l - k 1 } > a] < min{l, we~ a }, 

i 

E[max{q i - k' 1 }} < 1 + \nu. 

i 

Proof. Using 

P[q { <a]= max{0, l-e~ a } > 1 - e' a and P\q l > a] = min{l,e~ a } < e" a , 
valid for any a&M, the exact expression for P [max] in Lemma 1 follows from 

n n 

P[max{<f - fc*} < a] = Ptf - k l < a Vi] = JJ P[q { < a + V] = J] max{0, e _a - fc< } 
1 i=i i=i 

where the second equality follows from the independence of the q l . The bound on 
P[max] for any aEM (including negative a) follows from 

n n 

P[max{q l - h*} > a] = P[3i : q { - V > a] < ]T P[q* ~ V > a] < ^ e _a - fc< = u-e~ a 
1 i=i i=i 



where the first inequality is the union bound. Using E[z] < E[max{0,z}] = 
f£°P[m&x{0,z} > y]dy = j£°P[z > y]dy (valid for any real-valued random variable 
z) for z = maxi{q l — k % } — km, this implies 

/•oo /*oo 

Mmax{g l - k 1 } - kiu] < / P[max{<f - k 1 } > y + \nu]dy < / e~ y dy = 1, 
* Jo 1 Jo 

which proves the bound on _E[max]. □ 

If n is finite, a lower bound E[m&Xiq l ] >0.57721+lnn can be derived, showing that 
the upper bound on £[max] is quite tight (at least) for k % = Wi. The following bound 
generalizes [KV03, Lem.3] to arbitrary weights, establishing a relation between IFPL 
and the best expert in hindsight. 

Theorem 2 (IFPL bounded by BEH) LetVClRJ 1 , s t eR n for l<t<T (bothV 
and s may even have negative components, but we assume that all required extrema 
are attained), and q,k^M n . Ifi] t >0 is decreasing int, then the loss of the infeasible 
FPL knowing s t at time t in advance (l.h.s.) can be bounded in terms of the best 
predictor in hindsight (first term on r.h.s.) plus additive corrections: 

T k—q k 1 lk 
VM(si :t + -)°s t <mm{d°(s hT + —)} + —max{d°(q-k)} M(s 1:T + — )°q. 

Note that if T> = £ (or V = A) and s t >0, then all extrema in the theorem are 
attained almost surely. The same holds for all subsequent extrema in the proof and 
throughout the paper. 

Proof. For notational convenience, let rj = oo and s~i :t — Si : t + ^ 2 . Consider the 
losses s t = s t + (k — ?)(-^ — ^-7) for the moment. We first show by induction on T 
that the infeasible predictor M(si :t ) has zero regret for any loss s, i.e. 

T 

^M(s 1:t )°s t <M(si :T )°Si :T . (3) 
t=l 

For T = l this is obvious. For the induction step from T — 1 to T we need to show 

M(si :T ) °S T < M(S 1:T ) °§1:T ~ M(s <T ) °S <T . (4) 

This follows from S\ : t = s < t + st and M{s\.t)°s < t~>M{s < t)°s < t by minimality of 
M. Rearranging terms in (3), we obtain 

rp rp 

i2 M (si-.t)°St < M(s 1:T )o5 1:T -X;M(5 1:t )o(A;-g)(--— ) (5) 

Moreover, by minimality of M, 

M{~s VT )°~Si-t < M(s 1 . T + — )°(s 1 . t + ^—?-) (6) 
v rj T / \ r\T ' 



min < d°(si-T + —) \ — m(si- t + 

d£T> [ 7]T J V 



Vt Vt 
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holds. Using — > and again minimality of M, we have 

E(--—)M(S lrt ) •(?-*) < J2(--—)M(k-q)°(q-k) (7) 
t=i *7t Vt-i t =i Vt Vt-i 

1 1 

= — M(k — q)°(q — k) = — max{d°(q — k)\ 
rj T Vt rf ex> 

Inserting (6) and (7) back into (5) we obtain the assertion. □ 



Assuming q random with E[q l ] = 1 and taking the expectation in Theorem 2, 
the last term reduces to — L J2 , l =1 M(s 1 .. T + — )\ If £>>0, the term is negative and 
may be dropped. In case of T> = £ or A, the last term is identical to (since 
J2id l = ^) an d keeping it improves the bound. Furthermore, we need to evaluate the 
expectation of the second to last term in Theorem 2, namely E[max deT> {d°(q—k)}}. 
For V = £ and q being exponentially distributed, using Lemma 1, the expectation is 
bounded by 1 + lnu. We hence get the following bound: 



Corollary 3 (IFPL bounded by BEH) For V = 8 and E^e - ^ < 1 and P[<f ] = 
e -<?* j or q>0 and decreasing i] t >0, the expected loss of the infeasible FPL exceeds 
the loss of expert i by at most k l /vt-' 

T\-t < s\. T + — k % Mi. 
Vt 

Theorem 2 can be generalized to expert dependent factorizable Vt^V % t = VfV % by 
scaling k l ~^k l jv % and q l ^q l jv % ■ Using E[m£LX i {^^-}]<E[maxi{q t — k l }]/mm i {v' 1 }, 
Corollary 3, generalizes to 

t=i 'It 'It 'It 



where Vr™ := nrin^^}. For example, for v\ — yk*/t we get the desired bound 

s\. T -\-^T-{k i +A). Unfortunately we were not able to generalize Theorem 4 to expert- 
dependent v, necessary for the final bound on FPL. In Section 6 we solve this problem 
by a hierarchy of experts. 



4 Feasible FPL bounded by Infeasible FPL 

This section establishes the relation between the FPL and IFPL losses. Recall that 
£ t = E[M(s <t + ^)°s t ] is the expected loss of FPL at time t and r t = E[M(s 1:t + 
— )°s t ] is the expected loss of IFPL at time t. 
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Theorem 4 (FPL bounded by IFPL) ForV = S and 0<s l t <l \/i and arbitrary 
s <t and P[q] = e - ^ 9 ' for q>0, the expected loss of the feasible FPL is at most a 
factor e m > 1 larger than for the infeasible FPL: 

T 

£ t < e Vt r t , which implies £\._t — t\-t < T"! r) t £t- 

t=i 

Furthermore, ifr) t <l, then also £ t < (l+r) t +r)f)r t < (l + 2f] t )r t . 



Proof. Let s = s <t +^k be the past cumulative penalized state vector, q be a vector 
of independent exponential distributions, i.e. P[q l ]=e~ q \ and r\ = r\ t . Then 

p^>^- m+ i)] = | e _^; m+1) m _ s ;|;< m \ > e _ 

We now define the random variables J: = argminj{s l — ^q 1 } and J: = argminj{s l + s^ — 
ig 1 }, where 0<s£<l Vi Furthermore, for fixed vector xElR n and fixed j we define 
m:=minj^ {s* — i;r*} < mm if L j{s l + s l t — ^x 1 } =:m'. With this notation and using the 
independence of g- 7 from q % for all i^j, we get 

P[J = = x i Vi^ ]} = P[s j - \q> < m\q { = x i Vi ^ j] = P[g j > ??(s j - m)\ 

< e v P[q j > r](s j - m + 1)] < e"P[^' > 77 (s^' + s£ - m')] 

= e v P[s j + s{ - \q 3 < m'lq 1 = x l Vi^ j] = e^P[J = j\q l = x l Vi ^ j\ 

Since this bound holds under any condition x, it also holds unconditionally, i.e. 
P[I= J ]<e r ip[J=j}. For V=£ we have s I t =M(s <t +^)°s t and sf = M(s 1:t +^)°s t , 
which implies 

£ t = E[sl\ = jrsl-P[I = j] < e»j2si-P[J = 3] = e r, E[s{} = eV t . 

Finally, £ t -r t <r] t £ t follows from r t >e-^£ t >{l-r]t)£ u and e t <e^r t < (l+r) t +r)?)r t < 
(l + 2r] t )r t for r] t <l is elementary. □ 



Remark. As done by [KV03], one can prove a similar statement for general decision 
space V as long as X^l s tl <A is guaranteed for some A>0: In this case, we have 
£ t <e VtA r t . If n is finite, then the bound holds for A = n. For n = oo, the assertion 
holds under the somewhat unnatural assumption that S is ^-bounded. 
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5 Combination of Bounds and Choices for rjt 

Throughout this section, we assume 

V = S, s t e[0, l] n Vt, P[q] = e"^> 1 for q > 0, and J^e - *' < 1. (8) 

i 

We distinguish static and dynamic bounds. Static bounds refer to a constant r] t = r]. 
Since this value has to be chosen in advance, a static choice of r\ t requires certain 
prior information and therefore is not practical in many cases. However, the static 
bounds are very easy to derive, and they provide a good means to compare different 
PEA algorithms. If on the other hand the algorithm shall be applied without ap- 
propriate prior knowledge, a dynamic choice of r) t depending only on t and/or past 
observations, is necessary. 

Theorem 5 (FPL bound for static rj t — r](xl/y/L) Assume (8) holds, then the 
expected loss £ t of feasible FPL, which employs the prediction of the expert i mini- 
mizing s l <t + k ~ t q , is bounded by the loss of the best expert in hindsight in the following 
way: 

i) For i] t =i] = l/\/L with L > £ 1:T we have 
h-.T < s\. T + y/Lft + 1) Vi 



ii) For rj t — y Kj L with L > £ 1:T and k l < K\/i we have 
h-.T < s\. T + 2yfLK Vi 



Hi) For r] t = yk l /L with L > max{s l 1:T , k 1 } we have 
£i-.t < s\. T + 2VW + 3k* 

Note that according to assertion (Hi), knowledge of only the ratio of the com- 
plexity and the loss of the best expert is sufficient in order to obtain good static 
bounds, even for non-uniform complexities. 

Proof. For rj t = ^J K/ L and L>£ 1:T , from Theorem 4 and Corollary 3, we get 

T 

£i-.t - ri:T < J2 = £i: T \/k/L < \f~LK and r 1:T - s\. T < k { jr\ T = k^L/K 



Combining both, we get £\-t — s\. t < VL(VK + k l / y/K) . (i) follows from K — l and 
(ii) from k l <K. 



in) For i] = y k 1 / L < 1 we get 

£i-.t < eV 1:T <(l + r ? + 7 ? 2 )r 1:T <(l + ^+^)(4 r + y^A; i ) 

< 4 T + v / ZF+(J^ + ^)(L + v / ZF) = s\. T + 2^W + (2 + \l^-)k i 
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□ 



The static bounds require knowledge of an upper bound L on the loss (or the 
ratio of the complexity of the best expert and its loss). Since the instantaneous 
loss is bounded by 1, one may set L = T if T is known in advance. For finite n 
and k % = K = Inn, bound (ii) gives the classic regret oc yThm. If neither T nor L 
is known, a dynamic choice of i] t is n ecessary We first present bounds with regret 
oc VT, thereafter with regret oc \J s\. T . 

Theorem 6 (FPL bound for dynamic r] t (xl/\/t) Assume (8) holds. 

i) For r) t = l/Vi we have £ 1:T < s[. T + VT(k { + 2) Mi 
ii) For rj t = ^K/2t and k? < K Mi we have £ 1:T < s\ :T + I^ITK Mi 

Proof. For rj t = y/K/2t, using Ef=i^ < Io^ t = 2 ^T and £ t < 1 we get 

T 

£i-.t - r 1:T <J2^t< V2TK and r 1:T - s\. T < k^ryr = W 
t=i 

Combining both, we get £ 1:T -s\ :T < V2T(VK + k i /VK)- (i) follows from K = 2 
and (ii) from k l <K. □ 

In Theorem 5 we assumed knowledge of an upper bound L on £ 1: t- In an adaptive 
form, L t :—£ <t +l, known at the beginning of time t, could be used as an upper bound 
on £ 1:t with corresponding adaptive i] t (xl/y/L^. Such choice of r) t is also called self- 
confident [ACBG02]. 

Theorem 7 (FPL bound for self-confident ^ocl/v^t) Assume (8) holds. 

i) For rj t — 1/ ' \j2(£ <t + 1) we have 

£i-.t < s\. T + (^ + 1)^/2(4^ + 1) + 2(A; l + l) 2 V* 

ii) For rjt = \J K/2(£ <t + 1) and k % < K Mi we have 

£i-.t < s\, T + 2y/2(s\ :T + l)K + 8K Mi 

Proof. Using rj t = ^K/2(£ <t + l) < ^K/2£ 1:t and = {y/b-y/a){y/b+y/a)-$. < 
2(\/b—^/a) for a<b and t : = min{t:£ 1:t >0} we get 

£i:T-r 1:T < E vA < M- E < V2KJ2 = V2K^ 

t=t * * t =t V*l:t t=t 
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Adding r 1:T -s\ :T < < k l ^/2(£ 1:T +l)/ K we get 



Zv.T ~ s\ :T < ^/2/^ 1:T +l), where Vr 1 := y/K + k'/VK. 
Taking the square and solving the resulting quadratic inequality w.r.t. £ i: T we get 



k:T < S\, T + K l + y'2(si :T + l)« i + (K*) 2 < S1:T + \/ 2 ( S i:T + 1 )^ + 2 ^ 

For K — l we get v / ^ = £; l + l which yields (i). For k l < K we get /t l < 4fT which 
yields (ii). □ 

The proofs of results similar to (ii) for WM for 0/1 loss all fill several pages 
[ACBG02, YEYS04]. The next result establishes a similar bound, but instead of 
using the expected value £ <t , the best loss so far s™ n is used. This may have 
computational advantages, since s< t m is immediately available, while £ <t needs to 
be evaluated (see discussion in Section 9). 



Theorem 8 (FPL bound for adaptive oc l/ys< t m ) Assume (8) holds, 
i) For r] t = l/min{F + \j (fr) 2 + 2s* <t + 2} we have 



U.t < s\. T + (k l +2)J2s\. T + 2(k l +2Y Mi 



ii) For 7] t = w|-min{l, JK/s™l n } and k l <KMi we have 



h-.T < s\. T + 2^2Ks[. T + 5K \n(s\. T ) + 3K + 6 Vi 

We briefly motivate the strange looking choice for r\ t in (i). The first naive can- 
didate, r) t oc 1/ \Js™ n , turns out too large. The next natural trial is requesting 
rjt — 1/ y / 2min{s< t + ^-}. Solving this equation results in r\ t = l/( k % + ^ (k i ) 2 + 2s l <t ) , 
where i be the index for which s\, + — is minimal. 

Proof. Define the minimum of a vector as its minimum component, e.g. min(A;) = 
k mm . For notational convenience, let r] = oo and s 1:t = s 1:t + ^-. Like in the proof 
of Theorem 2, we consider one exponentially distributed perturbation q. Since 

M(s 1:t )°s t <M(s 1:t )°s 1:t -M(s <t )°s <t by (4), we have 

M(h:t)°St < M(s- ht )°h:t - M(5 <t )°5< t - M(h:t)° ( ^— ^ ~ J 

Since r] t <^y~ 2 , Theorem 4 asserts £ t < E[(l+r} t +rf)M(s 1:t ) °s t ], thus £i :T <A+B, 
where 



A = ^^[(l + 77 t + 77 2 )(M(s 1:t )°5 1:t -M( S ~ <t )°5 <t ) 



T 

E 

*=i 
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B = Y.E 



E[(l + r] T + 7^)M(5i :T ) °§1:t] - E[{1 + 771 + r/ 2 ) min( )] 

T-l 

+ £ £ [fo - Vt+i + ^ 2 - 77 2 +1 )M(s 1:t ) °s 1:t 
t=\ 

2w/ = \_(<l- k 



and 



(l + ri t + rit)M(h:t)° 



Vt Vt-i 



t=i 

+ 2) f 1 _ J_ U 1 + + v\ + g ^ - m + i + ^ - vl 

ti ' W ^-1/ Vt ti Vt 



Here, the estimate for B follows from — ^—^>0 and E[M(rj t Si :t +k — q)°(q — k)]< 
E[m.dJti{q l — k % }} < 1, which in turn holds by minimality of M, I^e~ fe! < 1 and 
Lemma 1. In order to estimate A, we set Si :t = Si :t + -^ and observe M(s 1:t ) °s 1:t < 
M(si :t )°(s 1:t — —) by minimality of M. The expectations of q can then be evaluated 
to E[M{s\;t) °q]— 1, and as before we have E[—mm(k — q)} < 1. Hence 

ii-.r < A + B < (l+r] T + Vt) ( M(s 1:T ) °s 1:T _ 1_\ + 1 + Vi + vl 

\ VtJ Vi 

+ E (Vt - Vt+i + Vt - Vt+i) [M(s 1:t ) °s 1:t - - ) + B (9) 

t=i V W 

T-l _ x 

< (1 + r] T + min(si :T ) + E (Vt ~ Vt+i + Vt ~ Vt+i) min (si:t) H h 2. 

We now proceed by considering the two parts of the theorem separately. 

(i) Here, y t = l/min(fc + \ / k 2 + 2s <t + 2). Fix t<T and choose m such that k m + 
^{k m ) 2 + 2s^ t + 2 is minimal. Then 



k k T 
mm(s 1:t + — ) < s™ + 1 + 



+ v/(A:™) 2 + 2s- t + 2) 2 = ^<^ 



rh , ~~ <t Vt 2V " ' vv " ' ' ' _/ 2^"2^ +1 

We may overestimate the quadratic terms Vt i n (9) by Vt ~ the easiest justification is 
that we could have started with the cruder estimate £ t <(l+2rj t )r t from Theorem 4. 
Then 

T-i k 1 

ii-.T < (1 + 2r] T ) min(s 1:T + — ) + 2 V (77* - 774+1) min(s 1:t + -) + — + 2 

f^i Vt Vi 

< (1 + 2^t)^ + 2E(^-^ + i)A + -|- + 2 

a Vt t=i A Vt Vi 

1 1 ^ / 1 1 \ 1 

< 7TT + — +E +-+2 

2 ^T S V^+i VtJ Vi 



< lmm(k + \]k 2 + 2s <T + 2) 2 + 2min(A; + ^jk 2 + 2s <T + 2) + 2 

< s\. T + (k* + 2)J2s\ :T + 2(k i ) 2 + 6^ + 6 for all i. 
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This proves the first part of the theorem. 

(ii) Here we have K>k l for all i. Abbreviate at = max{fT,s™ n } for l<t<T, 
then 7] t = at > K, and a t — a t -i < 1 for all t. Observe M(s\ :t ) — M(si :t ), 

n n . — y/K{a t -at-i) 2 2 _ K(a t -at-l) ] a t -at-l < 1 (-, , <H-at-i \ _ 

ln(a t )— ln(af_i) which is true for at ~ ffli ~ 1 < ^ < j^- This implies 
(r]t-r]t + i)K K(a t -a t - 1 ) ( a t - a t A 

< ~ <^ln H = K(\n(a t ) - m(a t _ij), 

^{at - at-i) + >/a^ - vCT) 



(ifc-ifc+iKT < 



IK (a* - at-i) 1 

— (Va* - V^T) + 



2 v /2a t _i( v /a^+ ^/a t -i) 



2 



use ttt — &t — 1 ^ 1 
and a^_i>i*C 



/if 1 



{^ t -4 +1 )K = KVK(a t - a^-i) 
^ \/2a tv /a t _i 



Ot_l>K 

< v^A"(ln(at) -ln(a t _i)), and 



la t -\ 

The logarithmic estimate in the second and third bound is unnecessarily rough and 
for convenience only. Therefore, the coefficient of the log-term in the final bound 
of the theorem can be reduced to 2fT without much effort. Plugging the above 
estimates back into (9) yields 



k-.T < + J-sZF + ^2Ks^ + 3K + 2 + J-sZF + (\K + * ) ln(s™ n ) 



+— + 2 < s™" + 2J2KsTt + 5K\n(s™^) + 3K + 6. 
'/: 

This completes the proof. □ 

Theorem 7 and Theorem 8 (i) immediately imply the following bounds on the 
VLoss-regrets: < J s\. T + 1 + V8K, \JI^r< J s[. T + 1 + ^2(^ + 1), and v^t< 



s\. T + y/2(k l + 2), respectively. 

Remark. The same analysis as for Theorems [5-8] (ii) applies to general T>, using 
£ t <e Vtn r t instead of £t<e m rt, and leading to an additional factor y/n in the regret. 
Compare the remark at the end of Section 4. 
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6 Hierarchy of Experts 



We deri ved bo unds which do not need prior knowledge of L with regret oc \JTK 
and oc \j s\. T K for a finite number of experts with equal penalty K = k % = Inn. For 
an infinite number of experts, unbounded expert-dependent complexity penalties 
k % are necessary (due to constraint X^e~ feI < 1). Bo unds for this case (without 
prior knowledge of T) with regret oc k l VT and oc k' l \J ' s\. T have been derived. In 
this case, the complexity k % is no longer under the square root. Although this 
already implies Hannan consistency, i.e. the average per ro und regret tends to zero 
as t^oo, improved regret bounds oc \jTk l and oc \j s\. T k % are desirable and likely 
to hold. We were not able to derive such improved bounds for FPL, but for a 
(slight) modification. We consider a two-level hierarchy of experts. First consider 
an FPL for the subclass of experts of complexity K , for each K<ElN. Regard these 
FPL^ as (meta) experts and use them to form a (meta) FPL. The class of meta 
experts now contains for each complexity only one (meta) expert, which allows us 
to derive good bounds. In the following, quantities referring to complexity class K 
are superscripted by K, and meta quantities are superscripted by ~. 

Consider the class of experts £ K := {i :K — 1 < k % <K} of complexity K, for each 
KelN. FPL^ makes randomized prediction if := argmin i6 £ k{s' i <1: + with 

1 jK T lt 

r)t :=JK/2t and suffers loss uf :=s t * at time t. Since k % <K Vie£ we can apply 
Theorem Q(ii) to FPL X : 

E[uf T ] = £f T < s\. T + 2V2TK Vi e£ K \/K G N. (10) 

We now define a meta state sf = uf and regard FPL X for K e IN as meta experts, 
so meta expert K suffers loss sf. (Assigning expected loss sf = E[uf] —if to FPL K 
would also work.) Hence the setting is again an expert setting and we define the 
meta FPL to predict I t :=argminxgw{gf f + ^—=^- } with fj t = l/\/i and k K = \+2\nK 

(implying X)x=i e ~^ K ^ Note that sf t = sf + ... + sf = s[ 1 +...+s[* sums over the 
same meta state components K, but over different components If in normal state 
representation. 

By Theorem 6(i) the g-expected loss of FPL is bounded by sf T + \/T(k K + 2). 
As this bound holds for all q it also holds in g-expectation. So if we define £i : t to 
be the q and q expected loss of FPL, and chain this bound with (10) for ie£ K we 
get: 

Iv.t < E[sf :T + Vf(k K +2)} = £f :T + VT(k K +2) 
< s\, T + VT[2^/2(k i + 1) + I + 2 ln(F+ 1) + 2], 

where we have used K < k l + 1 . This bound is valid for all % an d has the desired 
regret oc \/Tk % . Similarly we can derive regret bounds oc \js\. T k % by exploiting that 
the bounds in Theorems 7 and 8 are concave in s\. T and using Jensen's inequality. 
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Theorem 9 (Hierarchical FPL bound for dynamic i] t ) The hierarchical FPL 
employs at time t the prediction of expert if-.— I^, where 

7f := arg min s l <t + ^2- and I t : = arg min W + ... + s t ti + 2 5L - 

Under assumptions (8) and independent P[q K \ = e~^ K WK &W, the expected loss 
ii; T = E[s % i+... + ST ] of FPL is bounded as follows: 



a) For r}f = \J K/2t and fj t — l/y/i we have 
h-.T < s\. T + 2V2Tk l -{l +0(^)) Vi. 

b) For fj t as in (i) and rjf as in (ii) of Theorem { 7 S } we have 

< ^ + 2V^^-(l + 0(^)) + { 0( ^ !r) } V,. 

The hierarchical FPL differs from a direct FPL over all experts S. One potential way 
to prove a bound on direct FPL may be to show (if it holds) that FPL performs better 
than FPL, i.e. £i : t<£i-.t- Another way may be to suitably generalize Theorem 4 to 
expert dependent r\. 

7 Lower Bound on FPL 

A lower bound on FPL similar to the upper bound in Theorem 2 can also be proven. 

Theorem 10 (FPL lower-bounded by BEH) Let n be finite. Assume VC]R n 
and s t G M n are chosen such that the required extrema exist (possibly negative), 
q G M n , and r) t > is a decreasing sequence. Then the loss of FPL for uniform 
complexities (l.h.s.) can be lower-bounded in terms of the best predictor in hindsight 
(first term on r.h.s.) plus/minus additive corrections: 

T q i T 1 1 

M(s <t ) °s t > min{rf°Si. T } m&x\d°q} + Y~Y )M(s <t ) °q 

Proof. For notational convenience, let vo = oo and si :t = s± : t — ~ t - Consider the 
losses St — Sf — q{ — — ) for the moment. We first show by induction on T that the 

1 J~ Vt rn-i' J 

predictor M(s <t ) has nonnegative regret, i.e. 

T 

£M(s< t )°S t >M(si :T )°Sl:T. (11) 
t=l 

For T = l this follows immediately from minimality of M (s<i :=0). For the induction 
step from T — 1 to T we need to show 

-^(s<t) °§T > M(si:t) °Si:T ~ M(s <T ) °S<T- 



17 



Due to si : t = s < t+st, this is equivalent to M(s<t) °si : t>M(§i : t) °si : t, which holds 
by minimality of M. Rearranging terms in (11) we obtain 

£M(s <t )°s t > m(si :T )»5i :T + 2 ^(«<t) "?(--—), with (12) 

t=l t=l ^ ffe-l 

M(si-t) °si t = M(si-t ) °si-t — M(s\-t ) ° — > min{d°si-T} max{d°o} 

f] T ' Tfr Tfr deT> r] T dev L 

and £M(s< t )°g(--— ) > J_)M(s <t )°g 
£i K Vt Vt~i J t^iVt Vt-i J 

Again, the last bound follows from the minimality of M, which asserts that 
[M(s -q) - M(s)]°s > > [M(s-q)-M(s)}°(s-q) and thus implies that 
M(s-q)°q>M(s)°q. So Theorem 10 follows from (12). □ 

Assuming q random with E[q l ] = 1 and taking the expectation in Theorem 10, 
the last term reduces to J2t(^~ ~^—[)l>2iM (s <t y ■ If T>>0, the term is positive and 
may be dropped. In case of V = S or A, the last term is identical to (since 
J2id l = 1) and keeping it improves the bound. Furthermore, we need to evaluate 
the expectation of the second to last term in Theorem 10, namely E[maxd^v{d°q}}. 
For T> = 8 and q being exponentially distributed, using Lemma 1 with k l = Vi, the 
expectation is bounded by 1+hm. We hence get the following lower bound: 

Corollary 11 (FPL lower-bounded by BELT) ForV = S and any S and all k l 

equal and P[q l ] = e~ ql for q>0 and decreasing i] t >0, the expected loss of FPL is at 
most hm/rjT lower than the loss of the best expert in hindsight: 

p > „min ln ™ 
*1:T ^ S \T 

Vt 

The upper and lower bounds on £± : t (Theorem 4 and Corollaries 3 and 11) 
together show that 

(hl 1 if r] t ^0 and rjt-s™™ -> oo and W = K Vi (13) 



s l:t 



For instance, 7] t = ^K/2 s™l n . For rjt — ^jK/2(£ <t + l) we proved the bound in Theo- 
rem 7(ii). Knowing that J K/2(£ <t +l) converges to JK/2s™l n due to (13), we can 



derive a bound similar to Theorem 7(ii) for r\ t — yK/2s™ n . This choice for r\ t has 
the advantage that we do not have to compute £ <t (cf. Section 9), as also achieved 
by Theorem 8(ii). 

We do not know whether Theorem 10 can be generalized to expert dependent 
complexities k l . 
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8 Adaptive Adversary 



In this section we show that bounds that hold against an oblivious adversary auto- 
matically also hold against an adaptive one. 

Initial versus independent randomization. So far we assumed that the per- 
turbations q are sampled only once at time t — 0. As already indicated, under the 
expectation this is equivalent to generating a new perturbation q t at each time step 
t, i.e. Theorems 4-9 remain valid for this case. While the former choice was favor- 
able for the analysis, the latter has two advantages. First, repeated sampling of the 
perturbations guarantees better bounds with high probability (see next section). 
Second, if the losses are generated by an adaptive adversary (not to be confused 
with an adaptive learning rate) which has access to FPL's past decisions, then he 
may after some time figure out the initial random perturbation and use it to force 
FPL to have a large loss. We now show that the bounds for FPL remain valid, even 
in case of an adaptive adversary, if independent randomization q^>q t is used. 

Oblivious versus adaptive adversary. Recall the protocol for FPL: After each 
expert % made its prediction y\, and FPL combined them to form its own predic- 
tion yf FL , we observe x t) and Loss(x t ,y t ") is revealed for FPL's and each expert's 
prediction. For independent randomization, we have yf PL — yf PL (x <t ,yi :t ,q t ). For 
an oblivious (non- adaptive) adversary, x t — x t (x <t ,y <t ). Recursively inserting and 
eliminating the experts yl = yl( x <t,y<t) and yf PL , we get the dependencies 

u t := Loss(x t , yf PL ) = u t (x 1:t ,q t ) and s\ := Loss(x t ,y l t ) = s l t (x 1:t ), (14) 

where x\-t is a "fixed" sequence. With this notation, Theorems 5-8 read £\ : t = 
E[J2t=i u t(xi-t,qt)] < /(^1:t) for all X\-t € X T , where f(x\-.T) is one of the r.h.s. in 
Theorems 5-8. Noting that / is independent of q± : T, we can write this as 

T 

A 1 <0, where A t (x <t , q <t ) := maxE qt . T f V u T (x 1:T , q T ) - f(x 1:T ) , (15) 

T=l 

where E qt . T is the expectation w.r.t. qt-.-qr (keeping q <t fixed). 

For an adaptive adversary, x t =x t (x <t ,y <t ,y F ^ L ) can additionally depend on y<^ L . 
Eliminating y\ and yf FL we get, again, (14), but x t = x t (x <t ,q < t) is no longer fixed, 
but an (arbitrary) random function. So we have to replace x t by x t (x <t ,q <t ) in (15) 
for t — l..T. The maximization is now a functional maximization over all functions 
Xt(-,-)...XT(',-)- Using u max x<y .)E q [g(x(q),q)]=E q max x [g(x,q)}" we can write this as 



By < 0, where B t (x <t ,q <t ) := max£ ?t ...max£ 9T \^2u T (x 1:T ,q T ) - f(x 1:T ) , (16) 

r=l 

So, establishing B\ <0 would show that all bounds also hold in the adaptive case. 
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Lemma 12 (Adaptive=Oblivious) Let qi...qT^M T be independent random vari- 
ables, E qt be the expectation w.r.t. q t , f any function of x\-,t€lX t , and u t arbitrary 
functions of Xi :t and q t . Then, A t (x <t ,q <t ) = B t (x <t ,q<:t) for all l<t<T, where A t 
and B t are defined in (15) and (16). In particular, Ai<0 implies B 1 <0. 



Proof. We prove B t = A t by induction on t, which establishes the theorem. B T = A T 
is obvious. Assume B t = A t . Then 



is 

B t -i 



= m&xE a . ,B t 



Xt-1 

= maxEq^ 

— 1 



= max E . .A t 



= max EL , 

Xt-1 H 



max E a . „ 

Xf.T H 
t-1 



T -, 

,[J2u T (x 1:T ,q T ) - f(x 1:T )] 

r=l J 
t-1 T i 

u T (x 1:T , q T ) + max E qt . T \ ^ u T (x 1:T , q T ) - f(x 1:T )] 

r=l XUT r=t J 

S „ ' S v ' 

independent qt-i, since the qt are i.d. 



= max 

Xt-1 



v ' ~ v 

independent x t: T and q t: T independent qt-i, since the qt 

* > , * 

7 , 

i: T ,q T ) + max E qt . T \J2u T (x 1:T , q T ) - f(x 1:T )\ 

t-1 T 



£«-i[E 

r=l 



U T {X\;. 



r t-1 T 

= maxmaxE qt T E qt _ 1 ^ u t(xi-.t, q T ) + E u t( x i-.t, Qt) — /(^i:r) 



= A-i 



□ 



Corollary 13 (FPL Bounds for adaptive adversary) Theorems 5-8 also hold 
for an adaptive adversary in case of independent randomization q^q t - 

Lemma 12 shows that every bound of the form A\ < proven for an oblivious 
adversary, implies an analogous bound Bi < for an adaptive adversary. Note that 
this strong statement holds only for the full observation game, i.e. if after each 
time step we learn all losses. In partial observation games such as the Bandit case 
[ACBFS95], our actual action may depend on our past action by means of our 
past observation, and the assertion no longer holds. In this case, FPL with an 
adaptive adversary can be analyzed as shown by [MB04, PH05]. Finally, yl FPh can 
additionally depend on x t , but the "reduced" dependencies (14) are the same as for 
FPL, hence, IFPL bounds also hold for adaptive adversary. 



9 Miscellaneous 



Bounds with high probability. We have derived several bounds for the expected 
loss 1\._t of FPL. The actual loss at time t is u t — M(s <t + — -) °s t . A simple Markov 
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inequality shows that the total actual loss u\-t exceeds the total expected loss £\-.t — 
E[ui-.t\ by a factor of c> 1 with probability at most 1/c: 



p[ Ul:T > c-e 1:T ] < i/c. 

Randomizing independently for each t as described in the previous Section, the 
actual loss is u t = M(s <t + ~ LL )°St with the same expected loss £\-,t = E[ui : t] as 
before. The advantage of independent randomization is that we can get a much 
better high-probability bound. We can exploit a Chernoff-Hoeffding bound [McD89, 
Cor. 5. 2b], valid for arbitrary independent random variables 0<u t <l for t—l,...,T: 

ui:T -E[u 1:T }\ > SE[u 1:T }] < 2exp(-l5 2 E[u 1:T }), < 8 < 1. 



For 5 — J 3c/ ii-.T we get 



P[\u 1:T - £i-.t\ > s]?>c£v.t\ < 2e~ c as soon as £ 1:T > 3c. (17) 

Using (17), the bounds for £\ : t of Theorems 5-8 can be rewritten to yield similar 
bounds with high probability (1 — 2e~ c ) for u\-.t with small extra regret oc \J c-L or 



oc Jc-s\. T . Furthermore, (17) shows that with probability 1, Ui-,t/£v.t converges 



rapidly to 1 for £ 1:T — >oo. Hence we may use the easier to compute rj t = J K /2u <t 



instead of r\ t — y K/2(£ <t +l), likely with similar bounds on the regret. 

Computational Aspects. It is easy to generate the randomized decision of FPL. 
Indeed, only a single initial exponentially distributed vector q e M n is needed. Only 
for self-confident rj t oc 1/ \J~L^ t (see Theorem 7) we need to compute expectations 
explicitly. Given 7] t , from + 1 we need to compute £ t in order to update f] t . 
Note that £ t —w t °s t , where w\ — P[I t — i\ and I t :=argmm ie £{s t <t + k ~ g } is the actual 
(randomized) prediction of FPL. With s := s <t + k/r] t , P[I t = i] has the following 
representation: 

P[I t = i] = P[s--<s--VjVi] 



P[s = m A s > m Vj ^ i]dm 



Vt Vt 
q l 

— = m A 

Vt Vt 

n 



/ 

J P[q l = Vt{s l - m)] ■ J| P[g J < Vt(s j - m)]dm 

-co •_/.■ 



\M\ V J 

M:{i}CMCAf 



In the last equality we expanded the product and performed the resulting exponen- 
tial integrals. For finite n, the second to last one-dimensional integral should be 
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numerically feasible. Once the product Ilj=i(l — e _T?t ^ 3 _m )) has been computed in 
time 0(n), the argument of the integral can be computed for each i in time 0(1), 
hence the overall time to compute £ t is 0(c-n), where c is the time to numerically 
compute one integral. For infinite n, the last sum may be approximated by the dom- 
inant contributions. Alternatively, one can modify the algorithm by considering only 
a finite pool of experts in each time step; see next paragraph. The expectation may 
also be approximated by (Monte Carlo) sampling I t several times. 

Recall that approximating £ <t can be avoided by using s™ n (Theorem 8) or u <t 
(bounds with high probability) instead. 

Finitized expert pool. In the case of an infinite expert class, FPL has to compute 
a minimum over an infinite set in each time step, which is not directly feasible. One 
possibility to address this is to choose the experts from a finite pool in each time 
step. This is the case in the algorithm of [Gen03], and also discussed by [LW94]. 
For FPL, we can obtain this behavior by introducing an entering time t % > 1 for 
each expert. Then expert % is not considered for % < r\ In the bounds, this leads 
to an additional — in Theorem 2 and Corollary 3 and a further additional t % in 
the final bounds (Theorems 5-8), since we must add the regret of the best expert 
in hindsight which has already entered the game and the best expert in hindsight 
at all. Selecting r % = k l implies bounds for FPL with entering times similar to the 
ones we derived here. The details and proofs for this construction can be found in 
[PH05]. 

Deterministic prediction and absolute loss. Another use of w t from the second 
last paragraph is the following: If the decision space is V— A, then FPL may make a 
deterministic decision d = Wt& A at time t with bounds now holding for sure, instead 
of selecting e, with probability w\. For example for the absolute loss s\ = \xt — y\\ 
with observation x t E [0,1] and predictions y l t E [0,1], a master algorithm predicting 
deterministically w t °yt& [0,1] suffers absolute loss \x t —w t °y t \ <J2i w t\ x t~ Vt\ = ^t, and 
hence has the same (or better) performance guarantees as FPL. In general, masters 
can be chosen deterministic if prediction space y and loss-function Loss(x,y) are 
convex. For xt,y l t G {0,1}, the absolute loss \xt~Pt\ of a master deterministically 
predicting p t e [0,1] actually coincides with the pj-expected 0/1 loss of a master 
predicting 1 with probability p t . Hence a regret bound for the absolute loss also 
implies the same regret for the 0/1 loss. 



10 Discussion and Open Problems 

How does FPL compare with other expert advice algorithms? We briefly discuss 
four issues, summarized in Table 1. 

Static bounds. Here the coefficient of the regret term \J KL, referred to as the 
leading constant in the sequel, is 2 for FPL (Theorem 5). It is thus a factor of \[2 
worse than the Hedge bound for arbitrary loss by [FS97], which is sharp in some 
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sense [Vov95]. This is the price one pays for FPL and its easy analysis for adaptive 
learning rate. There is evidence that this (worst-case) difference really exists and 
is not only a proof artifact. For special loss functions, the bounds can sometimes 
be improved, e.g. to a leading constant of 1 in the static (randomized) WM case 
with 0/1 loss [CB97] 2 . Because of the structure of the FPL algorithm however, it is 
questionable if corresponding bounds hold there. 

Dynamic bounds. Not knowing the right learning rate in advance usually costs a 
factor of This is true for Hannan's algorithm [KV03] as well as in all our cases. 
Also for binary prediction with uniform complexities and 0/1 loss, this result has 
been established recently - [YEYS04] show a dynamic regret bound with leading 
constant \/2(l+e). Remarkably, the best dynamic bound for a WM variant proven 
by [ACBG02] has a leading constant 2y / 2, which matches ours. Considering the 
difference in the static case, we therefore conjecture that a bound with leading 
constant of 2 holds for a dynamic Hedge algorithm. 

General weights. While there are several dynamic bounds for uniform weights, 
the only previous result for non-uniform weights we know of is [Gen03, Cor. 16], 



which gives the dynamic bound fj^p tllc < s\. T +i + y (s\. T +i)ln(s\. T +i) for a p- 
norm algorithm for the absolute loss. This is comparable to our bound for rapidly 
decaying weights -u/ = exp(— i), i.e. k l = i. Our hierarchical FPL bound in Theorem 
9 (b) generalizes this to arbitrary weights and losses and strengthens it, since both, 
asymptotic order and leading constant, are smaller. 

It seems that the analysis of all experts algorithms, including Weighted Majority 
variants and FPL, gets more complicated for general weights together with adaptive 
learning rate, because the choice of the learning rate must account for both the 
weight of the best expert (in hindsight) and its loss. Both quantities are not known 
in advance, but may have a different impact on the learning rate: While increasing 
the current loss estimate always decreases rj t , the optimal learning rate for an expert 
with higher complexity would be larger. On the other hand, all analyses known so 
far require decreasing r] t . Nevertheless we conjecture that the bounds ccVTk 1 and 
oc \J s\. T k l also hold without the hierarchy trick, probably by using expert dependent 
learning rate rf t . 

Comparison to Bayesian sequence prediction. We can also compare the worst- 
case bounds for FPL obtained in this work to similar bounds for Bayesian sequence 
prediction. Let {z/j} be a class of probability distributions over sequences and assume 
that the true sequence is sampled from //£ {z/j} with complexity /c M {J2i e ~ k " 1 < !)■ 
Then it is known that the Bayes optimal predictor based on the e~ k " 1 -weighted 
mixture of z/j's has an expected total loss of at most + 2V L^k^ + 2/c M , where 
L M is the expected total loss of the Bayes optimal predictor based on /i [Hut03a, 



2 While FPL and Hedge and WMR [LW94] can sample an expert without knowing its prediction, 
[CB97] need to know the experts' predictions. Note also that for many (smooth) loss-functions like 
the quadratic loss, finite regret can be achieved [Vov90]. 



23 



V 


Loss 


conjecture 


Low.Bnd. 


Upper Bound 


static 


0/1 


1 


1? 


1 [CB97] 


static 


any 


V2\ 


V2 [Vov95] 


V2 [FS97, Hedge], 2 [FPL] 


dynamic 


0/1 


V2 


1? [Hut03b] 


V2 [YEYS04] , 2^/2 [ACBG02, WM-Type?] 


dynamic 


any 


2 


y/2 [Vov95] 


2^2 [FPL], 2 [Hut03b, Bayes] 



Table 1: Comparison of the constants c in regrets cVLossx Inn for various settings 
and algorithms. 



Thm.2], [Hut04b, Thm.3.48]. Using FPL, we obtained the same bound except for 
the leading order constant, but for any sequence independently of the assumption 
that it is generated by /i. This is another indication that a PEA bound with leading 
constant 2 could hold. See [Hut04a], [Hut03b, Sec.6.3] and [Hut04b, Sec.3.7.4] for a 
more detailed comparison of Bayes bounds with PEA bounds. 
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